fix(hot): incremental splitter; promote constants to BlockNumberList by rswanson · Pull Request #67 · init4tech/storage

rswanson · 2026-05-21T16:53:36Z

Addresses the four asks from the review of #66 plus the binary-search
critique posted as a follow-up.

Changes

1. Promote constants from `ShardedKey` to `BlockNumberList` (review #4)

The byte budget and the count bound are properties of the roaring
encoding, not of the key — moved them where they belong:

BlockNumberList::MAX_ENCODED_BYTES (was ShardedKey::MAX_SHARD_BYTES)
BlockNumberList::SAFE_INDICES_PER_SHARD = 67

2. Characterise the roaring treemap encoding (review #1, #2, #3)

SAFE_INDICES_PER_SHARD is documented with the per-component overhead
(treemap header, RoaringBitmap cookie, container descriptors + offset
table, container payload) and derived from the worst case of one index
per RoaringBitmap (~22 B/index).

Five characterisation tests added to signet-storage-types::int_list:

Test	Asserts
`dense_run_encodes_compactly`	run-friendly best case
`sparse_within_single_bitmap_costs_ten_bytes_per_index`	~10 B/index, single bitmap
`sparse_across_bitmaps_costs_twenty_two_bytes_per_index`	~22 B/index, worst case
`safe_indices_per_shard_fits_worst_case`	bound is safe
`safe_indices_per_shard_is_tight`	bound is not needlessly loose

The last two fail loudly if the constant drifts out of sync with the
encoding (e.g. on a roaring upgrade), forcing a re-derivation.

3. Replace binary search with exact incremental splitter

Per the Option A follow-up: the previous max_prefix_fitting did
log2(n) probes per shard, each probe rebuilding the full
BlockNumberList from scratch. It also had a if mid == 1 { break; }
branch that silently returned best = 1 without verifying that one
element actually fit.

The new chunk_by_encoded_size walks indices once through a working
BlockNumberList, checking serialized_size() after each push, and
rolling back the overflowing index into the next shard. One full
rebuild per shard boundary, no probing, no mid == 1 footgun.

4. Count-based fast path

append_to_sharded_history checks len() <= SAFE_INDICES_PER_SHARD
before calling serialized_size(), so the common case (small
appends) skips the size check entirely.

Out of scope

The layering observation (move the budget to a backend trait so
non-MDBX backends don't waste work splitting) is deferred to a separate
issue per its own author. This PR keeps MAX_ENCODED_BYTES on the
encoding type as a single shared default; a future PR can introduce a
HotKvWrite::max_value_bytes() override without churning the splitter.

Test plan

cargo +nightly fmt -- --check
cargo clippy --workspace --all-targets --all-features -- -D warnings
cargo clippy --workspace --all-targets --no-default-features -- -D warnings
RUSTDOCFLAGS="-D warnings" cargo doc --workspace --no-deps
cargo t -p signet-storage-types — five new characterisation tests pass
cargo t -p signet-hot --all-features — 103 tests pass, incl. mem_conformance which runs the new test_history_shard_fits_in_dupsort_limit regression
cargo t -p signet-hot-mdbx --all-features — 49 tests pass

🤖 Generated with Claude Code

…rList Addresses review feedback on #66. - Promote shard sizing constants from `ShardedKey` to `BlockNumberList`, where they belong: they are properties of the roaring treemap encoding, not the key: - `BlockNumberList::MAX_ENCODED_BYTES` (was `ShardedKey::MAX_SHARD_BYTES`) - `BlockNumberList::SAFE_INDICES_PER_SHARD = 67` — the count guaranteed to fit in `MAX_ENCODED_BYTES` under any u64 distribution, derived from the worst case of one index per `RoaringBitmap` (~22 B/index) plus the 8-byte treemap header. - Document the roaring serialisation overhead on `SAFE_INDICES_PER_SHARD` (cookie, container descriptors, offset table, payload) so the bound is reproducible. - Add characterisation tests in `signet-storage-types` covering dense runs, sparse-within-a-bitmap (~10 B/index), and worst-case across bitmaps (~22 B/index). The `safe_indices_per_shard_is_tight` test fails if the bound becomes loose, prompting future re-derivation. - Replace the binary-search splitter with an exact single-pass incremental builder: push, check, roll back on overflow. Avoids the `mid == 1` footgun in the previous implementation and does one `BlockNumberList` rebuild per shard boundary instead of `log2(n)` full re-encodings per shard. - Add a count-based fast path using `SAFE_INDICES_PER_SHARD` so the common case skips `serialized_size()` entirely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Self-review cleanup of the previous commit. - Use `BlockNumberList::remove(idx)` + `clear()` to roll back the overflowing index in place, instead of rebuilding the prefix from a re-sliced `Vec<u64>`. Drops one allocation + one full re-encoding per shard boundary. - Pass `last_shard.iter()` directly into `chunk_by_encoded_size` instead of materialising a `Vec<u64>` first. Drops the per-write N-sized allocation on the slow path. - Collapse `shard_start` + working `shard` to a single `prev_in_shard` flag — they tracked overlapping information, and the previous slow path serialised the same data twice (once into `shard`, once into the rebuilt `prefix`). - Add `BlockNumberList::remove` so the splitter doesn't need to reach through the tuple-struct boundary. - Drop two redundant comments (`// Pre-sorted, ...: push cannot fail.` restated the `expect` message; `// Emit the final (open) shard.` restated the code). - Fix `let mut last_shard = last_shard;` no-op rebind by moving `mut` into the destructuring. - Name the conformance test's previously-bare `30` as `SHARDS_TO_FORCE: usize = 30`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rswanson and others added 2 commits May 21, 2026 12:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(hot): incremental splitter; promote constants to BlockNumberList#67

fix(hot): incremental splitter; promote constants to BlockNumberList#67
rswanson wants to merge 2 commits into
dylan/split-shardsfrom
swanny/pr66-feedback

rswanson commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rswanson commented May 21, 2026

Changes

1. Promote constants from ShardedKey to BlockNumberList (review #4)

2. Characterise the roaring treemap encoding (review #1, #2, #3)

3. Replace binary search with exact incremental splitter

4. Count-based fast path

Out of scope

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Promote constants from `ShardedKey` to `BlockNumberList` (review #4)